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Estimation of linear, non-gaussian causal models 
in the presence of confounding latent variables 
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Abstract 

The estimation of linear causal models (also known as structural equation models) from 
data is a well-known problem which has received much attention in the past. Most pre- 
vious work has, however, made an explicit or implicit assumption of gaussianity, lim- 
iting the identifiability of the models. We have recently shown (Shimizu et al., 2005 
Hoyer et al. , 2006 ) that for non-gaussian distributions the full causal model can be esti- 



mated in the no hidden variables case. In this contribution, we discuss the estimation of 
the model when confounding latent variables are present. Although in this case uniqueness 
is no longer guaranteed, there is at most a finite set of models which can fit the data. We 
develop an algorithm for estimating this set, and describe numerical simulations which 
confirm the theoretical arguments and demonstrate the practical viability of the approach. 
Full Matlab code is provided for all simulations. 



1 Introduction 

The ultimate goal of much of the empirical sci- 
ences is the discovery of causal relations, as op- 
posed to just correlations, between variables. 
That is, one is not only interested in mak- 
ing predictions based on observations; one also 
wants to predict what will happen if one inter- 
venes and changes something in the system. 

In many cases, causal effects can be estimated 
using controlled randomized experiments. Un- 
fortunately, however, in many fields of science 
and in many studies it is not always possible to 
perform controlled experiments. Often it can be 
very costly, unethical, or even technically im- 
possible to directly control the variables whose 
causal effects we wish to learn. In such cases 
one must rely on observational studies com- 
bined with prior information and reasonable as- 
sumptions to learn causal relationships. This 
has been called the causal discovery problem 
( Pearl, 2000"! |Spirtes et al., 2000 ). 

Linear causal models, also known as struc- 



tural equation models, can be thought of as the 
simplest possible causal models for continuous- 
valued data, and they have been the target of 
much research over the past decades, see e.g. 
Pollen, 19891 |Silva et al., 20061 and references 
therein. The bulk of this research has, how- 
ever, made an implicit assumption of gaussian 
data (i.e. normally distributed data). Although 
the methods that have been developed work 
for any distributions, they have been limited 
in their estimation abilities by what can be ac- 
complished for gaussian data. Fortunately, in 
the real world, many variables are inherently 
non-gaussian, and in such a case one can ob- 
tain much stronger results than for the gaus- 
sian case. In earlier work ( Shimizu et al., 2005| 



Hoyer et al., 2006) we showed that if the vari- 



ables involved are non-normally distributed, 
yet linearly related to each other, the com- 
plete causal graph is identifiable from non- 
experimental data if no confounding hidden 
variables are present. In the present paper we 
show what can be done if this last assumption 



is violated. Although in this case uniqueness is 
no longer guaranteed, there is at most a finite 
set of models which can fit the data. 

This paper is organized as follows. First, in 
sections |2] and E3 we define the linear causal 
models that are the focus of this study. Sec- 
tion 0] describes how to estimate the causal 
model from data. Section El provides some sim- 
ulations confirming the practical viability of the 
approach. Finally, sections El and discuss fu- 
ture work and state some conclusions. 

2 Linear causal models 

Assume that we observe data generated 
by a linear, non-gaussian, acyclic model 
(i.e. a LiNGAM-process ( |Shimizu et al., 20051 
|Hoyer et al., 20061 )) but that we only observe a 
subset of the involved variables. That is, the 
process has the following properties: 

1. The full set of variables (including unob- 
served variables) X{, i = {1 . . . m} can be 
arranged in a causal order, such that no 
later variable causes any earlier variable. 
We denote such a causal order by k(i). 
That is, the generating process is recur- 
sive (Bollen, 1989 ), meaning it can be rep- 
resented by a directed acyclic graph (DAG) 
( PearT, 2000| [Spirtes et al., 2000) ) . 



2. The value assigned to each variable rcj is 
a linear function of the values already as- 
signed to the earlier variables, plus a 'dis- 
turbance' (error) term e^, and plus an op- 
tional constant term a, that is 



k(j)<k{i) 



b%j Xj ~\- Ci -\- C{. 



(1) 



3. The disturbances e« are all zero-mean 
continuous random variables with non- 
gaussian distributions of non-zero vari- 
ances, and the are independent of each 
other, i.e. p(e x , . . . ,e m ) = UiPi( e i}- 

4. The observed variables is a subset of the Xi. 
We denote the set containing the indices of 
the observed variables by J C {1, . . . , m}. 
In other words, our data set contains only 
the Xj, j € J. 



Figure ^ shows an example of such a latent 
variable LiNGAM model. 

An additional assumption not needed 
in our earlier work (Sh imizu et al., 2005 
Hoyer et al., 2006 ) but useful for the purposes 
of this paper is the requirement that the 
generating network is stable ( Pearl, 2*0 00) such 
that there is no exact canceling of effects. 
That is, if there is a directed path from Xi 
to Xj then Xi has a causal effect on Xj. This 
condition has also been termed faithfulness 
(Spir tes et al., 2000 ), and in our context im- 
plies that when multiple causal paths exist 
from one variable to another their combined 
effect does not equal exactly zero. 

Finally, we assume that we are able to observe 
a large number of data vectors x (which contain 
the observed variables Xj , j £ J), and that each 
is generated according to the above described 
process, with the same set of observed variables 
J, same causal order k(i), same coefficients bij, 
same constants Cj, and the disturbances sam- 
pled independently from the same distributions. 

The key difference to our previous work 
( Shimizu et al., 2005| 



Hoyer et al., 2006D is 



that we here allow unobserved confound- 
ing variables: hidden variables which affect 
multiple observed variables and hence are 
potentially problematic for any causal analysis. 
The key difference to other existing research 
involving linear models with latent variables, 
such as (Bollen, 19891 |Silva et alT~2~0 06), is our 
assumption of non-gaussian variables, allowing 
us to utilize methods based on higher-order 
statistics. 

3 Canonical models 

Consider the example model shown in Fig- 
ure It should be immediately clear that 
it is impossible to estimate the full generat- 
ing model from a sample of data vectors x = 
(xi, X2, x%, x§, xs) T ■ This can be seen most 
clearly by the fact that the data generating 
model contains a hidden variable (xj) with no 
observed descendants; detecting the presence of 
such a variable from our data is obviously not 
feasible since it has absolutely no effect on the 
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Figure 1: (a) An example of a latent variable LiNGAM model. The diagram corresponds to 
the following data generation process: x<i := 62, X4 := e^, X5 := e$, x§ := 2x2 + 4x4 + ee> 
x\ := 2x4 + 3x5 + ei, x 8 := 3x6 + e 8 , X3 := —8x4 — 5xi + e^, and X7 := — 7xi + 67. (Here, all the 
Cj are zero, but this is not the case in general.) The 'disturbance' variables are drawn mutually 
independently from non-gaussian distributions Pi(e{). Hidden variables are shown dashed; the 
observed data vector x consists of only the values of x±, X2, X3, X5, and X8- (b) The canonical 
model corresponding to the network in (a). This is the observationally and causally equivalent 
network where all causally irrelevant variables have been simplified away. See main text for details. 



observed data. Fortunately, the impossibility of 
detecting X7 and of estimating the strengths of 
any connections to it are not cause for concern. 
The reason for this is that the variable is not 
in any way relevant with respect to our goal: 
finding the causal relationships between the ob- 
served variables. 

Another causally irrelevant hidden variable is 
X6, which simply mediates the influence of X2 
and X4 onto X8- We have 

x 6 := 2x 2 + 4x 4 + e 6 
x 8 := 3x 6 + e 8 

leading to 

x 8 := 3(2x 2 + 4x 4 + e G ) + e 8 
= 6x 2 + 12x 4 + 3e 6 + e 8 
= 6x 2 + 12x 4 + e 8 

where we have simplified e' 8 = 3e^ + e 8 . This 
shows that we can remove the hidden variable 
xq from the model to obtain a simpler model 
in which the observed data is identical to the 
original one and, in addition, the causal impli- 
cations are the same. The resulting model is 
shown in Figure E?. Note that hidden variable 



X4 cannot be simplified away without changing 
the observed data. 

We formalize this idea of causally relevant 
versus irrelevant hidden variables using the fol- 
lowing concepts: 

Two latent variable LiNGAM models are ob- 
servationally equivalent if and only if the distri- 
bution p(x) of the observed data vector is iden- 
tical for the two models. This implies that the 
models cannot be distinguished based on obser- 
vational data alone. 

Two latent variable LiNGAM models are ob- 
servationally and causally equivalent if and only 
if they are observationally equivalent and all 
causal effects of observed variables onto other 
observed variables are identical for the two mod- 



els. In the notation of Pearl (20001 these causal 
effects are given by ^(x^ | do(xj 2 )), Ji, J2 C J, 
with J\f] J2 = 0. When these are identical for all 
choices of J\ and J2 the two models in question 
cannot be distinguished based on any observa- 
tions nor any controlled experiments. 

Finally, we define a canonical model to be any 
latent variable LiNGAM model where each la- 
tent variable is a root node (i.e. has no parents) 
and has at least two children (direct descen- 



dants). Furthermore, although different latent 
variables may have the same sets of children, 
no two latent variables exhibit exactly propor- 
tional sets of connection strenghts to the ob- 
served variables. Finally, each latent variable is 
restricted to have zero mean and unit variance. 

To derive a canonical model which is obser- 
vationally and causally equivalent to any given 
latent variable LiNGAM model, we can use the 
following algorithm: 

Algorithm A: Given a latent variable LiNGAM 
model, returns an observationally and causally 
equivalent canonical model 

1. First remove any latent variables without chil- 
dren. Iterate this rule until there are no more 
such nodes. 

2. For any connection of the form X — > Y, 
where 7 is a latent variable: (a) For all 
children Z of Y, add a direct connection 
X — > Z, the strength of the connection be- 
ing the product of the strenghts of X — > Y 
and Y — > Z. If a direct connection already 
existed, add the specified amount to the orig- 
inal strength, (b) Remove the connection 
X — ► Y. Iterate this rule until there are no 
more applicable connections. 

3. Remove any latent variable with only a single 
child, incorporating the latent variable's dis- 
turbance variable and constant into those of 
the child. Iterate this until there are no more 
such latent variables. 

4. For any pair of latent variables with exactly 
proportional sets of connection strenghts to 
the observed variables, combine these into a 
single latent variable. Iterate this until there 
are no more such pairs of latent variables. 

5. Finally, standardize the latent variables to 
have zero mean and unit variance by adjust- 
ing the connection strenghts to, and the con- 
stants of, their children. 



4 Model estimation 

In this section we show how, from data gen- 
erated by any stable latent variable LiNGAM 
model, to estimate the set of canonical models 
which are observationally equivalent to the gen- 
erating model. 

4.1 Model estimation by ICA 

In earlier work (S himizu et al., 2005 
Hoyer et al. , 2006 ) we showed that data 



generated from a LiNGAM process follows an 
independent component analysis (ICA) distri- 



bution flComon" T9941 |Hyvarinen et al., 20011 ) . 
Here, we briefly review this concept, specifically 
focusing on the effect of latent variables. 

We begin by considering the full data vec- 
tor x = {x\, . . . , x m }, which includes the latent 
variables. If we as preprocessing subtract out 
the means of the variables, then the full data 
satisfies x = Bx+e, where, because of the DAG 
assumption, B is a matrix that could be per- 
muted to strict lower triangularity if one knew 
a causal ordering k(i) of the variables. Solving 
for x one obtains x = Ae, where A = (I — B) _1 
contains the influence of the disturbance vari- 
ables onto the observed variables. Again, A 
could be permuted to lower triangularity (al- 
though not strict lower triangularity) with an 
appropriate permutation k(i). Taken together, 
the linear relationship between e and x and the 
independence and non-gaussianity of the com- 
ponents of e define the standard linear indepen- 
dent component analysis model. 

So far, this is just restating what we pointed 
out in our previous work. Now consider the ef- 
fect of hiding some of the variables. This yields 
x = Ae, where A contains just the rows of A 
corresponding to the observed variables. When 
the number of observed variables is less than 
the number of disturbance variables, A is non- 
square with more columns than rows. This is 
known as an overcomplete basis in the ICA lit- 
erature. 

Let us take a closer look at the structure in- 
herent in the 'mixing matrix' A. First, note 
that since for every latent variable LiNGAM 
model there is an observationally and causally 
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Figure 2: (a) Basic structure of the full ICA matrix A for a canonical model. (In this example, 
there are 3 hidden and 6 observed variables.) The top rows correspond to the hidden variables. 
Note that the assumption of a canonical model implies that this part of the matrix consists of an 
identity submatrix and zeros. The bottom rows correspond to the observed variables. Here, the 
shaded area represents values which, depending on the network, may be zero or non-zero. The 
white area represents entries which are zero by the DAG-assumption. (b) Since, by definition, we 
do not observe the latent variables, the information that we have is limited to the bottom part of 
the matrix shown in (a), (c) Due to the permutation indeterminancy in ICA, the observed basis 
matrix has its columns in random order, (d) Since we do not (up front) know a causal order for 
the observed variables, the observed mixing matrix A has the rows in the order of data input, not 
in a causal order. 




equivalent canonical model, we can without loss 
of generality restrict our analysis to canonical 
models. Next, arrange the full set of variables 
such that all latent variables come first (in any 
internal order) followed by all observed variables 
(in a causal order), and look at the structure of 
the full matrix A shown in Figure^. Although 
we only observe part of the full matrix, with 
randomly permuted columns and with arbitrar- 
ily permuted rows as in Figure EJl, the crucial 
point is that the observed ICA basis matrix A 
contains all of the information contained in the 
full matrix A in the sense that all the free pa- 
rameters in A are also contained in A. Thus 
there is hope that the causal model could be re- 
constructed from the observed basis matrix A. 

At this point, we note that we of course do 
not directly observe A but must infer it from the 



sample vectors. Eriksson and Koivunen (2004 1 
have recently shown that the overcomplete ICA 
model is identifiable given enough data, and 
several algorithms are available for this task, 
see e.g. ( |Moulines et al., 19971 |Attias, 19991 
Hyvarinen et al., 200ip . Thus, the remainder 
of this subsection considers the inference of the 
causal model were the exact ICA mixing ma- 
trix A known. The next subsection deals with 



the practical aspect of dealing with inevitable 
estimation errors. 

As in the standard square ICA case, iden- 
tification in the overcomplete case is only up 
to permutation and scaling of the columns of 
A. The scaling indeterminancy is not serious; 
it simply amounts to a problem of not being 
able to attribute the magnitude of the influence 
of a disturbance variable ej to its variance, the 
strength of its connection to its corresponding 
variable Xi, and in the case of hidden variables 
the average strength of the connections from 
that hidden variable to the observed variables. 
This is of no consequence since we are anyway 
never able to directly monitor the hidden vari- 
ables nor the disturbance variables, making the 
scaling simply a matter of definition. 

In contrast, the permutation indeterminancy 
is a serious problem, and in general leads to 
non-uniqueness in inferring the model: We can- 
not know which columns of A correspond to the 
hidden variables. Note, however, that this is the 
only information missing, as illustrated in Fig- 
ure |21 Thus, an upper bound for the number 
of observationally equivalent canonical models 
is the number of classifications into observed vs 
hidden. This is simply (N a + N h )\/(N \N h \), 



where N Q and denote the numbers of ob- 
served and hidden variables. For each classifi- 
cation we need to check whether this leads to a 
valid latent-variable LiNGAM model: 

Algorithm B: Given an overcomplete basis 
A (containing exact zeros) and the means of 
the observed variables, calculates all observation- 
ally equivalent canonical latent variable LiNGAM 
models compatible with the basis 

1. Nh is determined as the number of columns 
of A minus the number of rows. 

2. For each possible classification of the columns 
of A as belonging to disturbance variables of 
observed vs hidden variables: 

(a) Reorder the columns such that the ones 
selected as 'hidden variables' come first 

(b) Augment the basis by adding the unob- 
served top part of the matrix (as in Fig- 
ure^), obtaining an estimate of A. 

(c) Test whether it is possible to permute 
A (using independent row and column 
permutations) to lower-triangular form. 
If not, go to the next classification. 

(d) Divide each column of A by its diago- 
nal element, and calculate the connec- 
tion strengths B = I — A -1 

(e) Check that the found network is com- 
patible with the stability assumption. If 
not, go to the next classification. 

(f) Add the found network B to the list of 
observationally equivalent models which 
could have generated the data. 

4.2 Practical considerations 

Up to here we have assumed that the exact 
ICA basis matrix is known. In a real imple- 
mentation, of course, it can only be estimated 
from the data, inevitably leading to (small) es- 
timation errors, causing all elements of A to be 
non-zero and making Algorithm B not directly 
applicable. Furthermore, since the structure of 
the network is related to the inverse of A, small 
estimation errors will give rise to small direct ef- 
fects where none exist in the generating model. 



Solving these problems requires us to have 
an idea of the accuracy of our estimate of A. 
Here, we advocate using resampling methods 



( Efron and Tibshirani, 199 3), obtaining a set of 
estimates Aj representing our uncertainty re- 
garding the elements of the mixing matrix. This 
set can then be used as follows: First, one can 
infer which elements of A are exactly zero by 
standard statistical testing, taking as the null 
hypothesis the assumption that a given element 
is zero and rejecting it when the mean/variance 
ratio for that element is large enough (in an ab- 
solute value sense). This procedure will give the 
correct answer with probability approaching 1 
as the amount of data grows. 

Having identified the zeros of the basis ma- 
trix, we apply Algorithm B to each estimate Aj 
to yield a set of estimates for the generating 
causal model. 1 This set can, again, be utilized 
in statistical tests to prune out the small direct 
effects which result from estimation errors. 

A full description of the above ideas is out of 
the scope of this short paper. Nevertheless, they 
have been successfully implemented in our sim- 
ulations (see Section^ and the reader is invited 
to study the code package for all the details. 

5 Simulations 



We have performed extensive simulations in or- 
der to (i) verify the algorithms described, (ii) 
test the viability of the resampling approach for 
dealing with estimation errors, and (iii) provide 
a simple demonstration of learning a small hid- 
den variable LiNGAM model from data. Full 
well-documented Matlab code 2 for all of these 
experiments is available, to allow the reader 
to effortlessly replicate and verify our results, 
which unfortunately cannot be described in very 
great detail here due to lack of space. 

First, we tested the basic idea by generating 
random latent variable LiNGAM models and, 
for each model, calculating its corresponding 



x Note that this is, in general, a set of sets: One index 
corresponds to the different possible permutations of the 
columns of A, the other to the different estimates by 
resampling. 

2 at http: //www . cs .Helsinki . f i/patrik.hoyer/ 

code/lvlingam . tar . gz 



connections between observed variables 
connections to/from hidden variables 



Figure 3: (a) A randomly generated latent variable LiNGAM model, (b) The canonical model 
which is causally and observationally equivalent to the network in (a), (c) An observationally 
equivalent (but causally not equivalent) model. From the ICA basis which model (a) generates, it 
is impossible to distinguish between models (b) and (c). See main text for details. 



ICA basis matrix A and from it inferring the set 
of observationally equivalent canonical models 
using Algorithm B. This process is illustrated 
in Figure El In every case, exactly one model in 
the inferred set was causally equivalent to the 
original model. 

Second, we tested the practical viability of 
the approach by assuming that we only have 
a set of inexact estimates of the basis matrix: 
We generated random LiNGAM models and, for 
each model, calculated its corresponding ICA 
basis matrix, added noise to yield 20 differ- 
ent 'estimates' of it, and used the approach 
of Section 14.21 to infer the set of possible gen- 
erating models. In each case, one of the in- 
ferred models was close (in a causal sense) to 
the original model. When the noise becomes 
large enough to cause misclassifications of ze- 
ros, the method can fail and the resulting net- 
works are not necessarily close to the generat- 
ing one. This is analogous to the sensitivity 
of constraint-based causal discovery algorithms 
( Pearl, 2000] |Spirtes et al., 2000| ) to misclassifi- 
cations of conditional independencies in the ob- 
served data. 

Finally, we performed a small-scale demon- 
stration of learning the model from actual 
data. We used the mixture-of-gaussians frame- 



work flMoulines et al., 1997||Attias, 19990 to es- 
timate the ICA bases. To make the problem as 
simple as possible, the simulation was done on 
very small networks (3 observed and one hid- 
den variable) and we additionally assumed that 
we knew the distributions of all the disturbance 
variables. A sample of 1000 data vectors was 
used. Figure |1] shows a typical result. The algo- 
rithm here finds the correct structure of the net- 
work (in this case the network is uniquely iden- 
tifiable). This would not be possible with con- 
ventional methods as there are no conditional 
independencies among the observed variables. 

6 Future work 



Much work remains before the framework de- 
scribed here can be applied to practical data 
analysis problems. The most formidable prob- 
lem is that of reliably estimating an overcom- 
plete ICA basis matrix when the number of 
hidden variables and the distributions of the 
disturbance variables are unknown. Current 
algorithms dMoulines et al., 19971 |Attias, 1999 ) 
may work for very small dimensions but to our 
knowledge no methods are yet available for es- 
timating overcomplete ICA bases reliably and 
accurately in high dimensions. 



Fortunately, we may be able to solve rela- 
tively large problems even with current meth- 
ods. The key is that hidden variables only 
cause what might be called local overcom- 
pleteness. In a large LiNGAM model where 
each latent variable directly influences only 
a small number of observed variables, the 
data follows an independent subspace (ISA) 
model (Hyvarinen and Hoyer, 2000). In such 



a model, the data distribution can be rep- 
resented as a combination of low-dimensional 
independent subspaces. When the data fol- 
lows the ISA model, the individual subspaces 
might still be found using efficient ICA algo- 



rithms ( Hyvarinen et al., 2001 1 , and algorithms 
for overcomplete ICA would only need to be run 
inside each subspace. 

7 Conclusions 

We have recently shown how to estimate lin- 
ear causal models when the distributions in- 
volved are non-gaussian and no confound- 
ing hidden variables exist ( Shimizu et al., 2005 ; 
Hoyer et al., 2006 ). In this contribution, we 
have described the effects of confounding latent 
variables, and shown how to estimate the set of 
models consistent with the observed data. Sim- 
ulations, for which full Matlab code is available, 
confirm the viability of the approach. Further 
work is needed to develop the software into a 
practical, easy-to-use data-analysis tool. 
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